<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
<head>
<!-- Copyright 1997 The Open Group, All Rights Reserved -->
<title>lex</title>
</head><body bgcolor=white>
<center>
<font size=2>
The Single UNIX &reg; Specification, Version 2<br>
Copyright &copy; 1997 The Open Group

</font></center><hr size=2 noshade>
<h4><a name = "tag_001_014_1080">&nbsp;</a>NAME</h4><blockquote>
lex - generate programs for lexical tasks (<b>DEVELOPMENT</b>)
</blockquote><h4><a name = "tag_001_014_1081">&nbsp;</a>SYNOPSIS</h4><blockquote>
<pre><code>

lex -c<b>[</b>-t<b>][ </b>-n| -v<b>][</b><i>file</i>...<b>]</b>
</code>
</pre>
</blockquote><h4><a name = "tag_001_014_1082">&nbsp;</a>DESCRIPTION</h4><blockquote>
The
<i>lex</i>
utility generates C programs to be used in lexical processing of character
input, and that can be used as an interface to
<i><a href="yacc.html">yacc</a></i>.
The C programs are generated from
<i>lex</i>
source code and conform to the ISO&nbsp;C standard.
Usually, the
<i>lex</i>
utility writes the program it generates to the file
<b>lex.yy.c</b>;
the state of this file is unspecified if
<i>lex</i>
exits with a non-zero exit status.
See the EXTENDED DESCRIPTION section for a complete description of the
<i>lex</i>
input language.
</blockquote><h4><a name = "tag_001_014_1083">&nbsp;</a>OPTIONS</h4><blockquote>
The
<i>lex</i>
utility supports the <b>XBD</b> specification, <a href="../xbd/utilconv.html#usg"><b>Utility Syntax Guidelines</b>&nbsp;</a> .
<p>
The following options are supported:
<dl compact>

<dt><b>-c</b>
<dd>Indicate C-language action (default option).

<dt><b>-n</b>
<dd>Suppress the summary of statistics usually written with the
<b>-v</b>
option.
If no table sizes are specified in the
<i>lex</i>
source code and the
<b>-v</b>
option is not specified, then
<b>-n</b>
is implied.

<dt><b>-t</b>
<dd>Write the resulting program to standard output instead of
<b>lex.yy.c</b>.

<dt><b>-v</b>
<dd>Write a summary of
<i>lex</i>
statistics to the standard output.
(See the discussion of
<i>lex</i>
table sizes in
<xref href=lexdefs><a href="#tag_001_014_1092_001">
Definitions in lex
</a></xref>.)
If the
<b>-t</b>
option is specified and
<b>-n</b>
is not specified, this report will be written to standard error.
If table sizes are specified in the
<i>lex</i>
source code,
and if the
<b>-n</b>
option is not specified, the
<b>-v</b>
option may be enabled.

</dl>
</blockquote><h4><a name = "tag_001_014_1084">&nbsp;</a>OPERANDS</h4><blockquote>
The following operand is supported:
<dl compact>

<dt><i>file</i><dd>A pathname of an input file.
If more than one such
<i>file</i>
is specified, all files will be
concatenated to produce a single
<i>lex</i>
program.
If no
<i>file</i>
operands are specified,
or if a
<i>file</i>
operand is "-", the standard input will be used.

</dl>
</blockquote><h4><a name = "tag_001_014_1085">&nbsp;</a>STDIN</h4><blockquote>
The standard input will be used if no
<i>file</i>
operands are specified,
or if a
<i>file</i>
operand is "-".
See
<b>INPUT FILES</b>.
</blockquote><h4><a name = "tag_001_014_1086">&nbsp;</a>INPUT FILES</h4><blockquote>
The input files must be text files containing
<i>lex</i>
source code, as described in the EXTENDED DESCRIPTION section.
</blockquote><h4><a name = "tag_001_014_1087">&nbsp;</a>ENVIRONMENT VARIABLES</h4><blockquote>
If this variable is not set to the POSIX locale, the results are unspecified.
The following environment variables affect the execution of
<i>lex</i>:
<dl compact>

<dt><i>LANG</i><dd>Provide a default value for the internationalisation variables
that are unset or null.
If
<i>LANG</i>
is unset or null, the corresponding value from the
implementation-dependent default locale will be used.
If any of the internationalisation variables contains an invalid setting, the
utility will behave as if none of the variables had been defined.

<dt><i>LC_ALL</i><dd>
If set to a non-empty string value,
override the values of all the other internationalisation variables.

<dt><i>LC_COLLATE</i><dd>
Determine the locale for the
behaviour of ranges, equivalence classes
and multi-character collating elements
within regular expressions.
If this variable is not set to the POSIX locale, the results are unspecified.

<dt><i>LC_CTYPE</i><dd>
Determine the
locale for the interpretation of sequences of bytes of text data as
characters (for example, single- as opposed to multi-byte characters
in arguments and input files), and
the behaviour of character classes within regular expressions.
If this variable is not set to the POSIX locale, the results are unspecified.

<dt><i>LC_MESSAGES</i><dd>
Determine the locale that should be used to affect
the format and contents of diagnostic
messages written to standard error.

<dt><i>NLSPATH</i><dd>
Determine the location of message catalogues
for the processing of
<i>LC_MESSAGES .
</i>
</dl>
</blockquote><h4><a name = "tag_001_014_1088">&nbsp;</a>ASYNCHRONOUS EVENTS</h4><blockquote>
Default.
</blockquote><h4><a name = "tag_001_014_1089">&nbsp;</a>STDOUT</h4><blockquote>
If the
<b>-t</b>
option is specified,
the text file of C source code output of
<i>lex</i>
will be written to standard output.
<p>
If the
<b>-t</b>
option is not specified:
<ol>
<p>
<li>
Implementation-dependent informational,
error and warning messages concerning the contents of
<i>lex</i>
source code input
will be written to either the standard output or standard error.
<p>
<li>
If the
<b>-v</b>
option is specified
and the
<b>-n</b>
option is not specified,
<i>lex</i>
statistics
will also be written to either the standard output or standard error,
in an implementation-dependent format.
These statistics may also be generated if
table sizes are specified with a
"%"
operator in the
<i>Definitions</i>
section (see the EXTENDED DESCRIPTION section),
as long as the
<b>-n</b>
option is not specified.
<p>
</ol>
</blockquote><h4><a name = "tag_001_014_1090">&nbsp;</a>STDERR</h4><blockquote>
If the
<b>-t</b>
option is specified,
implementation-dependent informational,
error and warning messages concerning the contents of
<i>lex</i>
source code input will be written to the standard error.
<p>
If the
<b>-t</b>
option is not specified:
<ol>
<p>
<li>
Implementation-dependent informational,
error and warning messages concerning the contents of
<i>lex</i>
source code input
will be written to either the standard output or standard error.
<p>
<li>
If the
<b>-v</b>
option is specified
and the
<b>-n</b>
option is not specified,
<i>lex</i>
statistics
will also be written to either the standard output or standard error,
in an implementation-dependent format.
These statistics may also be generated if
table sizes are specified with a
"%"
operator in the
<i>Definitions</i>
section (see the EXTENDED DESCRIPTION section),
as long as the
<b>-n</b>
option is not specified.
<p>
</ol>
</blockquote><h4><a name = "tag_001_014_1091">&nbsp;</a>OUTPUT FILES</h4><blockquote>
A text file containing C source code will be written to
<b>lex.yy.c</b>,
or to the standard output if the
<b>-t</b>
option is present.
<br>
</blockquote><h4><a name = "tag_001_014_1092">&nbsp;</a>EXTENDED DESCRIPTION</h4><blockquote>
Each input file contains
<i>lex</i>
source code,
which is a table of regular expressions with corresponding
actions in the form of
C program fragments.
<p>
When
<b>lex.yy.c</b>
is compiled and linked with the
<i>lex</i>
library (using
the <b>-l&nbsp;l</b> operand with
<i><a href="c89.html">c89</a></i>
or
<i><a href="cc.html">cc</a></i>),
the resulting program reads character
input from the standard input and partitions it into
strings that match the given expressions.
<p>
When an expression is matched,
these actions will occur:
<ul>
<p>
<li>
The input string that was matched is left in
<i>yytext</i>
as a null-terminated string;
<i>yytext</i>
is either an external character array
or a pointer to a character string.
As explained in
<xref href=lexdefs><a href="#tag_001_014_1092_001">
Definitions in lex
</a></xref>,
the type can be explicitly selected using the
<b>%array</b>
or
<b>%pointer</b>
declarations, but the default is implementation-dependent.
<p>
<li>
The external
<b>int</b>
<i>yyleng</i>
is set to the length of the matching string.
<p>
<li>
The expression's corresponding
program fragment, or action, is executed.
<p>
</ul>
<p>
During pattern matching,
<i>lex</i>
searches the set of patterns
for the single longest possible match.
Among rules that match
the same number of characters, the rule given first
will be chosen.
<p>
The general format of
<i>lex</i>
source is:
<pre>
<dl compact><dt> <dd>
<i>Definitions</i>
<b>%%</b>
<i>Rules</i>
<b>%%</b>
<i>User&nbsp;Subroutines</i>
</dl>
</pre>
<p>
The first
<b>%%</b>
is required to mark the beginning of the rules
(regular expressions and actions);
the second
<b>%%</b>
is required only if user subroutines follow.
<p>
Any line in the
<i>Definitions</i>
section beginning with a
blank character
will be assumed to be a C program
fragment and will be copied to the external definition area of the
<b>lex.yy.c</b>
file.
Similarly, anything in the
<i>Definitions</i>
section included between delimiter lines containing only
<b>%{</b>
and
<b>%}</b>
will also be copied unchanged to the external
definition area of the
<b>lex.yy.c</b>
file.
<p>
Any such input (beginning with a
blank character
or within
<b>%{</b>
and
<b>%}</b>
delimiter lines) appearing at the beginning of the
<i>Rules</i>
section before any rules are specified will be written to
<b>lex.yy.c</b>
after the declarations of variables for the
<i>yylex()</i>
function and before the first line of code in
<i>yylex()</i>.
Thus, user variables local to
<i>yylex()</i>
can be declared here,
as well as application code to execute upon entry to
<i>yylex()</i>.
<p>
The action taken by
<i>lex</i>
when encountering any input
beginning with a
blank character
or within
<b>%{</b>
and
<b>%}</b>
delimiter lines appearing in the
<i>Rules</i>
section but coming after one or
more rules is undefined.
The presence of such input may
result in an erroneous definition of the
<i>yylex()</i>
function.
<h5><a name = "tag_001_014_1092_001">&nbsp;</a>Definitions in lex</h5>
<xref type="5" name="lexdefs"></xref>
<i>Definitions</i>
appear before the first
<b>%%</b>
delimiter.
Any line in this section not contained between
<b>%{</b>
and
<b>%}</b>
lines and not beginning with a
blank character
is assumed to define a
<i>lex</i>
substitution string.
The format of these lines is:
<pre>
<dl compact><dt> <dd>
<i>name substitute</i>
</dl>
</pre>
<br>
If a
<i>name</i>
does not meet the requirements for identifiers in the ISO&nbsp;C standard,
the result is undefined.
The string
<i>substitute</i>
will replace the string
{<i>name</i>}
when it is used in a rule.
The
<i>name</i>
string is recognised in this context only when the braces are provided
and when it does not appear within a bracket expression
or within double-quotes.
<p>
In the
<i>Definitions</i>
section, any line beginning with a
"%"
(percent sign) character and followed by an alphanumeric
word beginning with either
s
or
S
defines a set of start conditions.
Any line beginning with a
"%"
followed by a word beginning with either
x
or
X
defines a set of exclusive start conditions.
When the generated scanner is in a
<b>%s</b>
state,
patterns with no state specified will be also active;
in a
<b>%x</b>
state, such patterns will not be active.
The rest of the line, after
the first word, is considered to be one or more
blank-character-separated
names of start conditions.
Start condition names
are constructed in the same way as definition names.
Start conditions can be used to restrict
the matching of regular expressions
to one or more states as described
in
<xref href=lexre><a href="#tag_001_014_1092_004">
Regular Expressions in lex
</a></xref>.
<p>
Implementations accept either of the following two
mutually exclusive declarations in the
<i>Definitions</i>
section:
<dl compact>

<dt><b>%array</b><dd>Declare the type of
<i>yytext</i>
to be a null-terminated character array.

<dt><b>%pointer</b><dd>Declare the type of
<i>yytext</i>
to be a pointer to a null-terminated character string.

</dl>
<p>
The default type of
<i>yytext</i>
is implementation-dependent.
If an application refers to
<i>yytext</i>
outside of the scanner source file (that is, via an
<b>extern</b>),
the application will include the appropriate
<b>%array</b>
or
<b>%pointer</b>
declaration in the scanner source file.
<p>
Implementations will accept declarations in the
<i>Definitions</i>
section for setting certain internal table sizes.
The declarations are shown in the following table.
<pre>
<table  bordercolor=#000000 border=1 align=center><tr valign=top><th align=center><b>Declaration</b>
<th align=center><b>Description</b>
<th align=center><b>Minimum Value</b>
<tr valign=top><td align=left><b>%p </b><i>n</i>
<td align=left>Number of positions
<td align=left>2500
<tr valign=top><td align=left><b>%n </b><i>n</i>
<td align=left>Number of states
<td align=left>500
<tr valign=top><td align=left><b>%a </b><i>n</i>
<td align=left>Number of transitions
<td align=left>2000
<tr valign=top><td align=left><b>%e </b><i>n</i>
<td align=left>Number of parse tree nodes
<td align=left>1000
<tr valign=top><td align=left><b>%k </b><i>n</i>
<td align=left>Number of packed character classes
<td align=left>1000
<tr valign=top><td align=left><b>%o </b><i>n</i>
<td align=left>Size of the output array
<td align=left>3000
</table>
</pre>
<h6 align=center><xref table="Table Size Declarations in <I>lex</i>"></xref>Table: Table Size Declarations in <i>lex</i></h6>
<p>
In the table,
<i>n</i>
represents a positive decimal integer, preceded by one or more
blank characters.
The exact meaning of these table size numbers is implementation-dependent.
The implementation will document how these numbers affect the
<i>lex</i>
utility and how they are
related to any output that may be generated by the implementation should
space limitations be encountered during the execution of
<i>lex</i>.
It is possible to determine from this output
which of the table size values needs to
be modified to permit
<i>lex</i>
to successfully generate tables for the input language.
The values in the column Minimum Value represent the
lowest values conforming implementations will provide.
<h5><a name = "tag_001_014_1092_002">&nbsp;</a>Rules in lex</h5>
The rules in
<i>lex</i>
source files are a table in which
the left column contains regular expressions
and the right column contains actions (C program fragments)
to be executed when the expressions are recognised.
<pre>
<dl compact><dt> <dd>
<i>
ERE action
ERE action
  ...
</i>
</dl>
</pre>
<p>
The extended regular expression (<i>ERE</i>)
portion of a row will be separated from
<i>action</i>
by one or more blank characters.
A regular expression containing blank characters
is recognised under one of the following conditions:
<ul>
<p>
<li>
The entire expression appears within double-quotes.
<p>
<li>
The blank characters appear within double-quotes or square brackets.
<p>
<li>
Each blank character is preceded by a backslash character.
<p>
</ul>
<h5><a name = "tag_001_014_1092_003">&nbsp;</a>User Subroutines in lex</h5>
Anything in the user subroutines section will be copied to
<b>lex.yy.c</b>
following
<i>yylex()</i>.
<h5><a name = "tag_001_014_1092_004">&nbsp;</a>Regular Expressions in lex</h5>
<xref type="5" name="lexre"></xref>
The
<i>lex</i>
utility supports the set of extended regular expressions (see
the <b>XBD</b> specification, <a href="../xbd/re.html#tag_007_004"><b>Extended Regular Expressions</b>&nbsp;</a> ),
with the following additions and exceptions to the syntax:
<dl compact>

<dt><b>"..."</b><dd>Any string enclosed in double-quotes will represent the characters
within the double-quotes as themselves, except that backslash escapes
(which appear in the following table)
are recognised.
Any backslash-escape sequence is terminated by the closing quote.
For example,
"\01""1"
represents a single string:
the octal value 1 followed by the character 1.

<dt>&lt;<i>state</i>&gt;<i>r</i><dd>
<dt>&lt;<i>state1</i><b>,</b><i>state2</i><b>,</b>...&gt;<i>r</i><dd>
The regular expression <i>r</i> will be matched only
when the program is in one of the start conditions indicated by
<i>state</i>,
<i>state1</i>
and so on;
see
<xref href=lexacts><a href="#tag_001_014_1092_005">
Actions in lex
</a></xref>.
(As an exception to the typographical conventions
of the rest of this specification, in this case
&lt;<i>state</i>&gt; does not represent a metavariable,
but the literal angle-bracket characters surrounding a symbol.)
The start condition is recognised as such only
at the beginning of a regular expression.

<dt><i>r</i>/<i>x</i><dd>The regular expression <i>r</i> will be matched only if
it is followed by an occurrence of regular expression <i>x</i>.
The token returned in
<i>yytext</i>
will only match <i>r</i>.
If the trailing portion of <i>r</i> matches the
beginning of <i>x</i>, the result is unspecified.
The
<i>r</i>
expression cannot include further trailing context or the "$"
(match-end-of-line) operator;
<i>x</i>
cannot include the "^"
(match-beginning-of-line) operator, nor trailing context,
nor the "$" operator.
That is, only one occurrence of
trailing context is allowed in a
<i>lex</i>
regular expression, and the "^"
operator only can be used at the beginning of such an expression.


<dt><b>{</b><i>name</i><b>}</b><dd>When
<i>name</i>
is one of the substitution symbols from the <i>Definitions</i> section,
the string, including the enclosing braces, will be replaced by the
<i>substitute</i>
value.
The
<i>substitute</i>
value will be treated in the extended regular expression
as if it were enclosed in parentheses.
No substitution will occur if <b>{</b><i>name</i><b>}</b> occurs
within a bracket expression
or within double-quotes.

</dl>
<p>
Within an ERE, a backslash
character is considered to begin
an escape sequence as specified in the table in
the <b>XBD</b> specification, <a href="../xbd/notation.html"><b>File Format Notation</b>&nbsp;</a> 
(\\,
\a,
\b,
\f,
\n,
\r,
\t,
\v).
In addition, the escape sequences in the following table
will be recognised.
<p>
A literal
newline
character cannot occur within an ERE;
the escape sequence
\n
can be used to represent a
newline character.
A
newline character
cannot be matched by a period operator.
<pre>
<table  bordercolor=#000000 border=1 align=center>
<tr valign=top><th align=center><b>Escape<br>Sequence</b>
<th align=center><b>Description</b>
<th align=center><b>Meaning</b>
<tr valign=top><td align=left>\<i>digits</i>
<td align=left> A backslash character followed by the longest sequence of one, two or three octal-digit characters (01234567). If all of the digits are 0, (that is, representation of the NUL character), the behaviour is undefined. 
<td align=left> The character whose encoding is represented by the one-, two- or three-digit octal integer. If the size of a byte on the system is greater than nine bits, the valid escape sequence used to represent a byte is implementation-dependent. Multi-byte characters require multiple, concatenated escape sequences of this type, including the leading \ for each byte. 
<tr valign=top><td align=left>\<b>x</b><i>digits</i><b>
<td align=left> A backslash character followed by the longest sequence of hexadecimal-digit characters (01234567abcdefABCDEF). If all of the digits are 0, (that is, representation of the NUL character), the behaviour is undefined. 
<td align=left> The character whose encoding is represented by the hexadecimal integer. 
<tr valign=top><td align=left>\</b><i>c</i><b>
<td align=left> A backslash character followed by any character not described in this table or in the table in the <b>XBD</b> specification, <a href="../xbd/notation.html"><b>File Format Notation</b>&nbsp;</a>  ( \\, \a, \b, \f, \n, \t, \v).
<td align=left></b>The character <i>c</i>, unchanged.
</table>
</pre>
<h6 align=center><xref table="Escape Sequences in <I>lex</i>"></xref>Table: Escape Sequences in <i>lex</i></h6>
<p>
The order of precedence given to extended regular expressions for
<i>lex</i>
differs from that specified in
the <b>XBD</b> specification, <a href="../xbd/re.html#tag_007_004"><b>Extended Regular Expressions</b>&nbsp;</a> .
The order of precedence for
<i>lex</i>
is as shown in the following table,
from high to low.
<dl><dt><b>Note:</b>
<dd>The escaped characters entry
is not meant to
imply that these are operators, but they are
included in the table to show their relationships
to the true operators.
The start condition, trailing context
and anchoring notations have been omitted
from the table because of the placement restrictions
described in this section; they can only appear
at the beginning or ending of an ERE.
</dl>
<pre>
<table  bordercolor=#000000 border=1 align=center><tr valign=top><th align=center><b>Extended Regular Expression</b>
<th align=center><b>Precedence</b>
<tr valign=top><td align=left>collation-related bracket symbols
<td align=left>[= =]  [: :]  [. .]
<tr valign=top><td align=left>escaped characters
<td align=left>\&lt;<i>special character</i>&gt;
<tr valign=top><td align=left>bracket expression
<td align=left>[ ]
<tr valign=top><td align=left>quoting
<td align=left>"..."
<tr valign=top><td align=left>grouping
<td align=left>( )
<tr valign=top><td align=left>definition
<td align=left>{<i>name</i>}
<tr valign=top><td align=left>single-character RE duplication
<td align=left>* + ?
<tr valign=top><td align=left>concatenation
<td align=left>&nbsp;
<tr valign=top><td align=left>interval expression
<td align=left>{<i>m</i>,<i>n</i>}
<tr valign=top><td align=left>alternation
<td align=left>|
</table>
</pre>
<h6 align=center><xref table="ERE Precedence in <I>lex</i>"></xref>Table: ERE Precedence in <i>lex</i></h6>
<p>
The ERE anchoring operators "^" and "$") do not appear in the table.
With
<i>lex</i>
regular expressions, these operators are restricted in their use:
the "^" operator can only be used at the
beginning of an entire regular expression, and the "$"
operator only at the end.
The operators apply to the entire regular expression.
Thus, for example, the pattern
(^abc)|(def$)
is undefined;
it can instead be written as two
separate rules, one with the regular expression
^abc
and one with
def$,
which share a common action via the special "|"
action (see below).
If the pattern were written
^abc|def$,
it would match either
<b>abc</b>
or
<b>def</b>
on a line by itself.
<p>
Unlike the general ERE rules, embedded anchoring is not allowed
by most historical
<i>lex</i>
implementations.
An example of
embedded anchoring would be for patterns such as
(^|&nbsp;)foo(&nbsp;|$)
to match
<b>foo</b>
when it exists as a complete word.
This functionality can be obtained using existing
<i>lex</i>
features:
<pre>
<code>
^foo/[ \n]      |
" foo"/[ \n]    /* found foo as a separate word */
</code>
</pre>
<p>
Note also that "$"
is a form of trailing context (it is equivalent to
/\n)
and as such cannot be used
with regular expressions containing another instance of the operator (see
the preceding discussion of trailing context).
<p>
The additional regular expressions trailing-context operator "/"
can be used as an ordinary character
if presented within double-quotes, "/"; preceded by a backslash,
\/; or within a bracket expression, [/].
The start-condition "&lt;" and "&gt;"
operators are special only in a start condition
at the beginning of a regular expression;
elsewhere in the regular expression they
are treated as ordinary characters.
<p>
The following examples clarify the differences between
<i>lex</i>
regular expressions and regular expressions appearing elsewhere in
this specification.
For regular expressions of the form
<i>r</i>/<i>x</i>,
the string matching <i>r</i>
is always returned;
confusion may arise when the beginning of <i>x</i>
matches the trailing portion of <i>r</i>.
For example, given the regular expression
a*b/cc
and the input
<b>aaabcc</b>,
<i>yytext</i>
would contain the string
<b>aaab</b>
on this match.
But given the regular expression
x*/xy
and the input
<b>xxxy</b>,
the token
<b>xxx</b>,
not
<b>xx</b>,
is returned
by some implementations because
<b>xxx</b>
matches
x*.
<p>
In the rule
ab*/bc,
the
b*
at the end of
<i>r</i>
will extend
<i>r</i>'s
match into the beginning of the trailing
context, so the result is unspecified.
If this rule were
ab/bc,
however,
the rule
matches the text
<b>ab</b>
when it is followed by the text
<b>bc</b>.
In this latter case, the matching of
<i>r</i>
cannot extend into
the beginning of
<i>x</i>,
so the result is specified.
<h5><a name = "tag_001_014_1092_005">&nbsp;</a>Actions in lex</h5>
<xref type="5" name="lexacts"></xref>
The action to be taken when an
<i>ERE</i>
is matched can be a C program fragment
or the special actions described below;
the program fragment can contain
one or more C statements, and can also include
special actions.
The empty C statement ";" is a valid action; any string in the
<b>lex.yy.c</b>
input that matches the pattern portion of such a rule
is effectively ignored or skipped.
However, the absence of an action is not valid,
and the action
<i>lex</i>
takes in such a condition is undefined.
<p>
The specification for an action,
including C statements and special
actions, can extend across several lines if enclosed in braces:
<pre>
<code>
<i>ERE &lt;one or more blanks&gt;</i> { <i>xprogram statement
'|\nxu'program statement</i> }
</code>
</pre>
<p>
The default action when a string in the input to a
<b>lex.yy.c</b>
program is not matched by any expression
is to copy the string to the output.
Because the default behaviour of a program generated by
<i>lex</i>
is to read the input and copy it to the output, a minimal
<i>lex</i>
source program that has just
<b>%%</b>
generates a C
program that simply copies the input to the output unchanged.
<p>
Four special actions are available:
<pre>
<code>
|   ECHO;   REJECT;   BEGIN
</code>
</pre>
<dl compact>

<dt>|<dd>The action "|"
means that the action for the next rule is the action for this rule.
Unlike the other three actions, "|"
cannot be enclosed in braces or be semicolon-terminated;
it must be specified alone, with no other actions.

<dt>ECHO;<dd>Write the contents of the string
<i>yytext</i>
on the output.

<dt>REJECT;<dd>
Usually only a single expression
is matched by a given string in the input.
<b>REJECT</b>
means &quot;continue to the next
expression that matches the current
input&quot;, and causes whatever rule was the
second choice after the current
rule to be executed for the same input.
Thus, multiple rules can be matched and executed for one
input string or overlapping input strings.
For example, given the regular expressions
<b>xyz</b>
and
<b>xy</b>
and the input
<b>xyz</b>,
usually only the regular expression
<b>xyz</b>
would match.
The next attempted match would start after
z.
If the last action in the
<b>xyz</b>
rule is
<b>REJECT</b>,
both this rule and the
<b>xy</b>
rule would be executed.
The
<b>REJECT</b>
action may be implemented in such a fashion that flow of control does not
continue after it, as if it were equivalent to a
<b>goto</b>
to another part of
<i>yylex()</i>.
The use of
<b>REJECT</b>
may result in somewhat
larger and slower scanners.

<dt>BEGIN<dd>The action:
<pre>
<code>
BEGIN <i>newstate</i>;
</code>
</pre>
switches the state (start condition) to
<i>newstate .</i>
If the string
<i>newstate</i>
has not been declared previously
as a start condition in the
<i>Definitions</i>
section, the results are unspecified.
The initial state is indicated by the digit
0
or the token
<b>INITIAL</b>.

</dl>
<p>
The functions or macros described below
are accessible to user code
included in the
<i>lex</i>
input.
It is unspecified whether they appear in
the C code output of
<i>lex</i>,
or are accessible only through the
<b>-l&nbsp;l</b>
operand to
<i><a href="c89.html">c89</a></i>
or
<i><a href="cc.html">cc</a></i>
(the
<i>lex</i>
library).
<dl compact>

<dt>int yylex(void)<dd>
Performs lexical analysis on the input;
this is the primary function generated by the
<b>lex</b>
utility.
The function returns zero when the end of input is reached;
otherwise it returns non-zero values
(tokens) determined by the actions that are selected.

<dt>int yymore(void)<dd>
When called, indicates that
when the next input string is recognised,
it is to be appended to the current value of
<i>yytext</i>
rather than replacing it; the value in
<i>yyleng</i>
is adjusted accordingly.

<dt>int yyless(int <i>n</i>)<dd>
Retains <i>n</i> initial characters in
<i>yytext</i>,
NUL-terminated,
and treats the remaining
characters as if they had not been read; the value in
<i>yyleng</i> is adjusted accordingly.

<dt>int input(void)<dd>
Returns the next character from the input,
or zero on end-of-file.
It obtains input from the stream pointer
<i>yyin</i>,
although possibly via an intermediate buffer.
Thus, once scanning has begun,
the effect of altering the value of
<i>yyin</i>
is undefined.
The character read is removed from the input stream
of the scanner without any processing by the scanner.

<dt>int unput(int <i>c</i><dd>
Returns the character <i>c</i> to the input;
<i>yytext</i>
and
<i>yyleng</i>
are undefined until the next expression is matched.
The result of using <i>unput</i> for more characters than have
been input is unspecified.

</dl>
<p>
The following functions appear only in the
<i>lex</i>
library accessible
through the
<b>-l&nbsp;l</b>
operand; they can therefore be redefined by a
portable application:
<dl compact>

<dt>int yywrap(void)<dd>
Called by
<i>yylex()</i>
at end-of-file; the default
<i>yywrap()</i>
always will return 1.
If the application requires
<i>yylex()</i>
to continue processing with another source of input,
then the application can include a function
<i>yywrap()</i>,
which associates another file with the
external variable
<b>FILE</b>
*<i>yyin</i> and will return a value of zero.

<dt>int main(int <i>argc</i>, char *<i>argv</i>[])<dd>
Calls
<i>yylex()</i>
to perform lexical analysis, then exits.
The user code can contain
<i>main()</i>
to perform application-specific operations, calling
<i>yylex()</i>
as applicable.

</dl>
<p>
The reason for breaking these functions into two lists
is that only those functions in
<b>libl.a</b>
can be reliably redefined by a portable application.
<p>
Except for
<i>input()</i>,
<i>unput()</i>
and
<i>main()</i>,
all external and static names generated by
<i>lex</i>
begin with the prefix
<b>yy</b>
or
<b>YY</b>.
</blockquote><h4><a name = "tag_001_014_1093">&nbsp;</a>EXIT STATUS</h4><blockquote>
The following exit values are returned:
<dl compact>

<dt>0<dd>Successful completion.

<dt>&gt;0<dd>An error occurred.

</dl>
</blockquote><h4><a name = "tag_001_014_1094">&nbsp;</a>CONSEQUENCES OF ERRORS</h4><blockquote>
Default.
</blockquote><h4><a name = "tag_001_014_1095">&nbsp;</a>APPLICATION USAGE</h4><blockquote>
Portable applications are warned that in the
<i>Rules</i>
section, an
<i>ERE</i>
without an action is not acceptable, but need not be detected as
erroneous by
<i>lex</i>.
This may result in compilation or run-time errors.
<p>
The purpose of
<i>input()</i>
is to take
characters off the input stream and discard them as far as the lexical
analysis is concerned.
A common use is to discard the body of a comment
once the beginning of a comment is recognised.
<p>
The
<i>lex</i>
utility is not fully
internationalised in its treatment of regular expressions in the
<i>lex</i>
source code or generated lexical analyser.
It would seem desirable
to have the lexical analyser interpret the regular expressions given in the
<i>lex</i>
source according to the environment
specified when the lexical analyser is executed, but this is not
possible with the current
<i>lex</i>
technology.
Furthermore, the very nature of the lexical analysers produced by
<i>lex</i>
must be closely tied to the lexical requirements of the input language
being described, which will frequently be locale-specific anyway.
(For example, writing an analyser that is used for French text will not
automatically be useful for processing other languages.)
</blockquote><h4><a name = "tag_001_014_1096">&nbsp;</a>EXAMPLES</h4><blockquote>
The following is an example of a
<i>lex</i>
program that implements a rudimentary scanner for a Pascal-like
syntax:
<pre>
<code>
%{
/* need this for the call to atof() below */
#include &lt;math.h&gt;
/* need this for printf(), fopen() and stdin below */
#include &lt;stdio.h&gt;
%}

DIGIT    [0-9]
ID       [a-z][a-z0-9]*

%%

{DIGIT}+    {
    printf("An integer: %s (%d)\n", yytext,
        atoi(yytext));
    }

{DIGIT}+"."{DIGIT}*        {
    printf("A float: %s (%g)\n", yytext,
        atof(yytext));
    }

if|then|begin|end|procedure|function        {
    printf("A keyword: %s\n", yytext);
    }

{ID}    printf("An identifier: %s\n", yytext);

"+"|"-"|"*"|"/"        printf("An operator: %s\n", yytext);

"{"[^}\n]*"}"    /* eat up one-line comments */

[ \t\n]+        /* eat up white space */

.    printf("Unrecognised character: %s\n", yytext);

%%

int main(int argc, char *argv[])
{
    ++argv, --argc;  /* skip over program name */
    if (argc &gt; 0)
        yyin = fopen(argv[0], "r");
    else
        yyin = stdin;

    yylex();
}
</code>
</pre>
</blockquote><h4><a name = "tag_001_014_1097">&nbsp;</a>FUTURE DIRECTIONS</h4><blockquote>
None.
</blockquote><h4><a name = "tag_001_014_1098">&nbsp;</a>SEE ALSO</h4><blockquote>
<i><a href="c89.html">c89</a></i>,
<i><a href="yacc.html">yacc</a></i>.
</blockquote><hr size=2 noshade>
<center><font size=2>
UNIX &reg; is a registered Trademark of The Open Group.<br>
Copyright &copy; 1997 The Open Group
<br> [ <a href="../index.html">Main Index</a> | <a href="../xshix.html">XSH</a> | <a href="../xcuix.html">XCU</a> | <a href="../xbdix.html">XBD</a> | <a href="../cursesix.html">XCURSES</a> | <a href="../xnsix.html">XNS</a> ]

</font></center><hr size=2 noshade>
</body></html>
